Aggregating Frame-Level Information in the Spectral Domain With Self-Attention for Speaker Embedding
نویسندگان
چکیده
Most pooling methods in state-of-the-art speaker embedding networks are implemented the temporal domain. However, due to high non-stationarity feature maps produced from last frame-level layer, it is not advantageous use global statistics (e.g., means and standard deviations) of as aggregated embeddings. This motivates us explore stationary spectral representations perform aggregation In this paper, we propose attentive short-time (attentive STSP) a Fourier perspective exploit local stationarity maps. STSP, for each utterance, compute through weighted average windowed segments within spectrogram by attention weights aggregate their lowest components form embedding. Because most map energy concentrated low-frequency region domain, STSP facilitates information retaining low only. Attentive shown consistently outperform on VoxCeleb1, VOiCES19-eval, SRE16-eval, SRE18-CMN2-eval. observation suggests that applying segment-level leveraging can produce discriminative
منابع مشابه
the search for the self in becketts theatre: waiting for godot and endgame
this thesis is based upon the works of samuel beckett. one of the greatest writers of contemporary literature. here, i have tried to focus on one of the main themes in becketts works: the search for the real "me" or the real self, which is not only a problem to be solved for beckett man but also for each of us. i have tried to show becketts techniques in approaching this unattainable goal, base...
15 صفحه اولUsing Exciting and Spectral Envelope Information and Matrix Quantization for Improvement of the Speaker Verification Systems
Speaker verification from talking a few words of sentences has many applications. Many methods as DTW, HMM, VQ and MQ can be used for speaker verification. We applied MQ for its precise, reliable and robust performance with computational simplicity. We also used pitch frequency and log gain contour for further improvement of the system performance.
متن کاملUsing Exciting and Spectral Envelope Information and Matrix Quantization for Improvement of the Speaker Verification Systems
Speaker verification from talking a few words of sentences has many applications. Many methods as DTW, HMM, VQ and MQ can be used for speaker verification. We applied MQ for its precise, reliable and robust performance with computational simplicity. We also used pitch frequency and log gain contour for further improvement of the system performance.
متن کاملAggregating Frame-level Features for Large-Scale Video Classification
This paper introduces the system we developed for the Google Cloud & YouTube-8M Video Understanding Challenge, which can be considered as a multi-label classification problem defined on top of the large scale YouTube-8M Dataset [1]. We employ a large set of techniques to aggregate the provided frame-level feature representations and generate video-level predictions, including several variants o...
متن کاملusing exciting and spectral envelope information and matrix quantization for improvement of the speaker verification systems
speaker verification from talking a few words of sentences has many applications. many methods as dtw, hmm, vq and mq can be used for speaker verification. we applied mq for its precise, reliable and robust performance with computational simplicity. we also used pitch frequency and log gain contour for further improvement of the system performance.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing
سال: 2022
ISSN: ['2329-9304', '2329-9290']
DOI: https://doi.org/10.1109/taslp.2022.3153267